Vast Challenge MC2 II

This blog is for the Vast Challenge 2021, Mini Challenge 2 Q3, Q4, and Q5.

Linli Zhong https://github.com/LinliZhong
07-23-2021

Note: This is the solution for Q4 and Q5 of Vast Challenge MC2. To view solution for Q1, Q2 and Q3, please click here(https://linlizhong.netlify.app/posts/2021-07-25-assignment-mc2/)

1. Overview

1.1 Background

GAStech is a natural gas production company in an island country of Kronos in the past twenty years. It builds strong relationships with the government of Kronos but low environment stewardship. In January 2014, when celebrating the success fortune by IPO (initial public offering), several employees of GAStech went missing.

This post is focused on VAST Challenge 2021 Mini-Challenge 2. Mini-Challenge 2 provides useful data of missing employees’ relevant information, a list of vehicle assignments by employee, vehicle tracking data, loyalty card transaction data, credit and debit card transaction data, and a tourist map of Abila. In this post, I will make use of R language to visual and analyze available data and identify the suspicious activity of the employees.

1.2 Literature review

The task of Mini Challenge 2 in 2014 was to find unusual patterns in employees’ daily lives from their credit/debit/loyalty card records and the GPS tracking records of their cars. In this article, I mainly refer to two works by Central South University and KU Leuven.

For Central South University, they used D3, MySQL, Excel as their visualization tools. They first identified daily routines for GAStech employees. They divided these missing employees into general staffs and truck drivers. Then they discuessed these two groups separately. They detected suspicious activities by visualization the credit card and gps data.

For KU Leuven, they had three stages to visualize. (1) Aggregate & Slice, (2) Design, Filter & Analyze and (3) Communicate. They first made full use of R to test their hypothesis. Then involved the design and implementation of streamlined tools to optimize the identification of specific patterns. After that, they presented and visualized the results.

2. Data Preparation

2.1 Install and load packages

The code chunk below is to check required packages are installed or not, if they are not installed, the code chunk will help install them automatically. After all packages are installed, the code chunk will load them.

packages = c('igraph','ggraph','visNetwork','lubridate','dplyr',
             'clock','tidyverse','DataExplorer','gplots','plotly',
             'ggplot2','superheat','tm','plotly','lattice','GDAdata','raster',
             'sf','tmap','foreign','gifski','rgdal','tiff','sp','leaflet',"DT"
             )
for(p in packages){
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p,character.only = T)}

2.2 Import relevant data

The data is stored in MC2 file. Read_csv()is to import four csv files into R. “Windows-1252” is a single-byte character encoding of the Latin alphabet for value “Katerina’s Café”.

cc<- read_csv("MC2/cc_data.csv",locale = locale(encoding = "windows-1252"))
lyt<-read_csv("MC2/loyalty_data.csv",locale = locale(encoding = "windows-1252"))
gps <-read_csv("MC2/gps.csv")
car<-read_csv("MC2/car-assignments.csv")

2.3 Data preparation

1. Change data type. The code chunk below is to convert id in gps from number to factor, last4ccnum in cc from number to character and timestamp in gps, loyalty, and credit card from character data type to Datetime format.

gps$id<-as.factor(gps$id)
car$CarID<-as.factor(car$CarID)
cc$last4ccnum<-as.character(cc$last4ccnum)
gps$Timestamp<-date_time_parse(gps$Timestamp,zone = "",format = "%m/%d/%Y %H:%M:%S")
lyt$timestamp<-date_time_parse(lyt$timestamp,zone = "",format = "%m/%d/%Y")
cc$timestamp<-date_time_parse(cc$timestamp,zone = "",format = "%m/%d/%Y %H:%M")

2. Combine data. The code chunk below is to combine first name and last name of employee together in the list of car.

car$Name<-paste(car$FirstName,car$LastName)

3. Add additional column. When analyzing the car assignment csv file, we may notice that the GAStech does not assign specific company truck for each truck driver, but this is not a case for other employees with title which is not truck driver. Hence, we can divide missing employees into two types, one is general employees, another one is truck driver, to further investigate their suspicious activities. Besides, add additional column for car type, and there are two types of cars, one for company truck another one for company car.

car$Category<-ifelse(car$CurrentEmploymentTitle == "Truck Driver", 
                       "Truck Driver","General Staff")
gps$car_type<-ifelse(gps$id %in% c('101','104','105','106','107'),'Company truck',
                     'company car')

The code chunk below is to extract day of the timestamp and add it in column called “day” in each file.

cc$day<-get_day(cc$timestamp)
lyt$day<-get_day(lyt$timestamp)
gps$day<-get_day(gps$Timestamp)

The code chunk below is the first part of vast challenge.

cc4<-cc
cc4$timestamp<-format(cc$timestamp,format="%Y-%m-%d")
lyt4<-lyt
lyt4$timestamp<-format(lyt4$timestamp,format="%Y-%m-%d")
cc_lyt<-merge(x=cc4,y=lyt4,all.x = TRUE)
x <-if_else(is.na(cc_lyt$loyaltynum),"No","Yes")
cc_lyt$use_loyalty<-x

suspicious_cc<- c("9551", "8332", "3484","5407","8156")
sus_day <- c(12,13,19)
cc_lyt<-cc_lyt%>%
  filter(day %in% sus_day, last4ccnum %in% suspicious_cc, 
         location == "Kronos Mart")
DT::datatable(cc_lyt)
#Import raster file
bgmap <- raster("MC2/Geospatial/MC2-tourist.tif")
# import the Abila shapefile
Abila_layout <- "MC2/Geospatial/Abila.shp"
Abila_shape <- readOGR(Abila_layout)
OGR data source with driver: ESRI Shapefile 
Source: "E:\LinliZhong\DataViz_blog\_posts\2021-07-25-mc2-part-2\MC2\Geospatial\Abila.shp", layer: "Abila"
with 3290 features
It has 9 fields
Integer64 fields read as strings:  TLID 
#Plot raster layer
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,alpha = NA,
       saturation = 1,interpolate = TRUE, max.value = 255)+
  tm_shape(Abila_shape)+
  tm_lines(col= "grey", lwd=1)
#Import GIS data file
Abila_st <- st_read(dsn = "MC2/Geospatial",
                    layer = "Abila")
Reading layer `Abila' from data source 
  `E:\LinliZhong\DataViz_blog\_posts\2021-07-25-mc2-part-2\MC2\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84
#Converting aspatial data into a simple feature data frame
gps_sf <- st_as_sf(gps, coords = c("long", "lat"),crs= 4326)
#Creating movement path from GPS points
gps_path <- gps_sf %>%
  group_by(id, day) %>%
  summarize(m = mean(Timestamp), do_union=FALSE) %>%
  st_cast("LINESTRING")

3.4 Question 4

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.

gps_track <- read_csv("MC2/gps.csv")
gps_track$id<-as.factor(gps$id)
gps_track$Timestamp<-date_time_parse(gps_track$Timestamp,zone = "",
                                     format = "%m/%d/%Y %H:%M:%S")
gps5<- gps_track
gps5<- dplyr::rename(gps5, arrival_time = Timestamp)
gps5$date<-as.Date(format(gps5$arrival_time, "%Y-%m-%d"))
gps5$day_of_week<-wday(gps5$arrival_time, label=TRUE)
gps5 <- gps5[order(gps5$id,gps5$date,gps5$arrival_time), ]
gps5 <- gps5 %>%
  group_by(id) %>%
  mutate(departure_time = lead(arrival_time))%>%
  mutate(time_difference = departure_time - arrival_time)#%>%
gps5<-gps5%>%
  filter(time_difference >= 300 & time_difference <=18000)
gps5$arrival_time<-format(as.POSIXct(gps5$arrival_time), format = "%H:%M:%S")
gps5$departure_time<-format(as.POSIXct(gps5$departure_time), format = "%H:%M:%S")
gps6<-gps5%>%
  filter(id != "107"& id != "106"&id != "105"&
         id != "104"&id != "101")
gps6<-merge(x = gps6, y = car[ , c("Name", "CarID")], 
            by.x='id', by.y='CarID')
gps_sf <- st_as_sf(gps6, coords = c("long", "lat"),crs= 4326)
#Creating movement path from GPS points
gps_path <- gps_sf %>%
  group_by(Name, id, date,day_of_week,time_difference,arrival_time,departure_time)
gps_path_selected2 <- gps_path %>%
  filter(date == "2014-01-11")
tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,alpha = NA,saturation = 1,
       interpolate = TRUE,max.value = 255) +
  tm_shape(Abila_shape)+
  tm_lines(col= "grey", lwd=1)+
  tm_shape(gps_path_selected2)+
  tm_dots() #+tm_facets(by = "date",ncol = 2)